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Abstract 

Simulating sample correlation matrices is important in many areas of statistics. 
Approaches such as generating normal data and finding their sample correlation ma- 
trix or generating random uniform [—1,1] deviates as pairwise correlations both have 
drawbacks. We develop an algorithm for adding noise, in a highly controlled manner, 
to general correlation matrices. In many instances, our method yields results which are 
superior to those obtained by simply simulating normal data. Moreover, we demon- 
strate how our general algorithm can be tailored to a number of different correlation 
models. Finally, using our results with an existing clustering algorithm, we show that 
simulating correlation matrices can help assess statistical methodology. 



1 Introduction 



As computational resources continue to improve, researchers can take advantage of 
simulation studies to investigate properties and results associated with novel statisti- 
cal methodology. In particular, simulating correlation matrices with or without a given 
structure can provide insight into the sensitivity of a model. There has been exten- 
sive work on simulating correlation matrices with random entries; that is, generating 
positive semidefinite matrices with all entries bounded by [—1, 1] and having ones on 



the diagonal. Seminal work by Marsaglia and Olkin 30 discusses distributional char- 
acteristics and eigenvalues of simulated random correlation matrices. Although there 
has been additional work expanding the ideas associated with generating random cor- 
relation matrices [8 , 16 ,21 28 35 and even randomly generating correlation matrices 
within particular settings 15 33 , to our knowledge there is no literature devoted to 
the problem of adding noise to given correlation structures. 

We discuss the need to simulate realistic correlation matrices in a specific context. 
By realistic we mean not only that the correlation matrix has some prescribed struc- 
ture (dependent upon the requirements of the particular application), but also that 
it is noisy. Simulating correlation matrices is important for a variety of situations. 
For example, methods that use simulated correlation matrices include probit anal- 
ysis 29 44 , shrinkage estimation |2 
management science (32 



meta-analysis [101, multiple comparisons 23 



and factor analysis 17 , as well as clustering. Some clus- 



tering and classification methods simulate correlations (or covariances) using uniform 
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distributions [20 24 25 38 . However, unconstrained randomly simulated correlation 
matrices are typically not positive semidefinite. Although we provide motivation and 
a detailed example using clustering, our method is applicable to any context where 
simulating realistic correlation matrices is important. 

There is an enormous literature devoted to methods designed for unsupervised 
clustering. In particular, the clustering of microarray data has generated much recent 
interest. Clustering techniques can be quite powerful because they provide models 
or novel groupings in a setting where class labels are unknown. However, the lack 
of knowledge of underlying structure makes assessing clustering algorithms difficult 
in realistic settings. Indeed, there has been a call for work that seeks to determine 
the reproducibility of clustering methods instead of continued work on developing new 
clustering algorithms [I 37 . 

Suppose that we are given a N x N correlation matrix S = (S^)^- =1 . Generating a 
noisy correlation matrix S = (Sij)fj =1 based upon the template S can be difficult since 
noise must be added to S in such a way that S is positive semidefinite, and satisfies 
Sa = 1 and — 1 < Sij < 1 for 1 < i,j < N. Moreover, for numerical purposes (e.g., 
generating data from S) one might also require an explicit upper bound on the condition 
number of S to ensure its numerical stability (e.g., for matrix inversion). Unfortunately, 
naively adding random noise to a correlation matrix can result in matrices which violate 
any or all of the above constraints. 

1.1 Simulating Data for Evaluating Algorithms 



Many clustering algorithms (e.g., hierarchical clustering, PAM 22 , HOPACH 40 
QT clustering [l4], WGCNA [43]) use distances in lieu of raw data to partition or 
agglomerate observations into groups. Additionally, correlation distance has been used 
recently for high dimensional data reduction methods in order to apply functional 
data analysis |5|. A method for determining differential co-expression based only on 
correlation (and not raw data) is given by Choi and Kendziorski [fjj. In fact, correlation 
has become a standard method for determining similarity between high throughput 
observations; distances are computed from one minus the correlation or one minus the 
absolute correlation. With most high throughput data, the relationships of interest are 
due to directional changes across experimental conditions (measured by correlation) as 
opposed to magnitude changes across experimental conditions (measured by Euclidean 
distance). 

In order to assess the accuracy of a clustering algorithm, it is important to have 
a known group structure, many authors use a certain group structure and simulate 
normal data from that matrix. For example, in a recent paper, Tritchler et al. simulate 
normal data to assess a method for filtering genes prior to clustering and network 
analysis. Their clustering structure consists of within cluster correlations of 0.4 and 
between cluster correlations of |39|. Using clustering to find differentially expressed 
genes, Hu et al. generate normal deviates in a two-cluster setting with one cluster of 
100 observations correlated at 0.94, another cluster of 608 observations clustered at 
0.9, and observations from different clusters correlated at 0.697 [19] . 

We appreciate the difficulty in generating realistic data with known cluster struc- 



ture. However, we believe that using normal deviates often adds an additional unnec- 



essary layer of assumptions. In Section 4.1 we demonstrate that our method produces 



matrices that are more general than the class of matrices produced by finding the sam- 
ple correlation of normally distributed data. Indeed, we do not believe that microarray 
data are normally distributed [13]. Instead of simulating normal data from a known 
correlation structure, we argue in favor of simulating correlation matrices directly based 
on a known correlation structure. The random correlation matrices can then be used 
to assess the algorithm at hand. 

Typically, the r/th entry Sij of a N x N correlation matrix S = ($y)fj=i is the 
Pearson correlation between the ith and j'th objects. In particular, S is symmetric, 
each entry lies between —1 and 1, and the diagonal entries of S are all equal to 1. 
Correlation matrices with such a structure can be constructed from any correlations 
computed on the pairs of entries associated with a particular row and column (e.g., 
Spearman correlation or biweight correlation In general, any positive semidef- 



inite matrix is a covariance matrix 31 , p. 3] and any such matrix corresponds to a 
correlation matrix. One approach for generating random correlation matrices draws 
independent observations from a uniform distribution on [—1, 1] as the entries of the 



82 correlation matrix 24 25 . Random entries of the correlation matrix will not necessar- 

83 ily create positive definite matrices, and any further processes based on inverting the 

84 correlation matrix will be impossible. The correlation matrix can provide a tremendous 

85 amount of information about the underlying data structure, but due to the required 

86 algebraic relations, generating a noisy correlation matrix which satisfies certain pre- 

87 scribed conditions is not trivial. 

as 1.2 Three existing models 

89 The goal of our work is to provide an algorithm for simulating correlation structures 

90 that: 

91 • are realistic yet flexible in structure, 

92 • contain sufficient noise for within group correlations, 

93 • contain sufficient noise for between group correlations, 

94 • can be used in lieu of simulating (Gaussian) data from the correlation matrix. 

95 Again, to motivate the properties of the simulated correlation matrix, we point out 

96 (1) microarray data are typically not normally distributed, (2) many clustering and 

97 classification methods rely only on the distances (i.e., correlations) between observa- 

98 tions and not their observed values, (3) it is important to build in possible structure 

99 or hierarchical relationships across different values. 

100 It has been observed, though (to our knowledge) not investigated, that with high- 

101 throughput data, the dependencies between clusters are weak but existent. As Guo 

102 et al. declare, "[although it is true that weak connections between groups may exist, 

103 independence between groups is usually a reasonable assumption" M. 

104 Below we have outlined three methods for generating correlation matrices, each of 

105 which describes different dependence structures for simulating clusters of data. Our 
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106 paper offers a flexible way to generate correlation structures given any reasonable model 

107 of what we would expect across observational units. 

108 1.2.1 Constant correlation model 

109 A standard clustering structure is one with fixed correlations within a cluster and 
no between clusters (often with correlation equal to zero between clusters). Hu et al. |19| 
in simulate observations correlated at the same level with off diagonal noise also specified. 

' 0.94 1 < i < j < 100, 
Sy = < 0.697 for 1 < i < 100 < j < 708, 
k 0.90 for 101 < i < j < 708, 

us 1.2.2 Toeplitz model 

114 Another structure is one that models high correlation for observations which are close 

115 together in the correlation matrix and such that correlation values decrease for observa- 

116 tions which are increasingly far away. In building a classification model, Guo et al. [9] 

117 describe a Toeplitz structure (sometimes referred to as an auto-regressive structure) to 
us the correlation matrix, where adjacent pairs of observations are highly correlated, and 

119 those further away are less correlated. Let 

( 1 Pk p\ pI ■■■ pT~ 1 \ 

Pk 1 Pk P k Pk 

2 i Tit.— 3 

Pk Pk 1 Pk ■■■ Pk 

^k — 3 2 i rii,— 4 (1) 

Pk Pk Pk 1 Pk 

\Pk Pk Pk Pk 1 / 

120 be the correlation matrix for fcth cluster, given by the correlation value pk- In this 

121 model, the between group correlations are set to zero. Additional classification models 

122 have used similar Toeplitz structure for simulating data from a correlation matrix 

123 [7|[20j[34j|4lJ|45]. In fact, Huang et al. use a t/[0.5,1.5] distribution to simulate the 

124 variance components of the population in order add noise to the above prescribed 

125 structure. 

126 1.2.3 Hub observation model 

127 The last model is one that is hierarchical in nature based on a single hub-observation 

128 and the relationship of each observation to that original hub. Horvath et al. (26, 27,43] 

129 define a correlation structure with respect to a particular profile (or hub-observation). 

130 Each observation in the cluster is correlated with the hub-observation with decreasing 

131 strength (from a user supplied maximum correlation to a given minimum correlation). 



4 



132 Additionally, clusters are generated independently (that is, with correlation zero be- 

133 tween clusters). For the ith observation (i = 2, 3, ... n), the correlation between it and 

134 the hub-observation is given by, S = (£jj)^- =1 : 

£i,hub = Pmax - ((« - 2)/ (n - 2)) 7 ( / o max - p min ) 



135 Note that the correlation between the ith observation and the hub will range from p max 

136 to pmi n ; the rate at which the correlations decay is controlled by the power, 7. 

137 Motivated by the three existing models above, we provide an algorithm for adding 

138 noise to correlation matrices. Section [2] contains the theoretical details behind our pro- 

139 cedure, while Section [3] concerns several specific applications in the contexts discussed 

140 above. In Section |4j we give an example of how simulated correlation matrices with 

141 realistic structure can be used to assess a statistical method. 



142 2 The Basic Procedure 

143 2.1 Preliminaries 

Recall that if A is a N x N symmetric matrix, then each of its eigenvalues is real and 
hence we may list them in descending order 

\x(A)>\ 2 (A)>--->\ N (A), 

144 where each eigenvalue is repeated according to its multiplicity. According to this 

145 convention, A is positive semi-definite if and only if An (A) > and A is positive 

146 definite if and only if An (A) > 0. 

147 The norm of a N x N matrix A is defined to be 

\\A\\ = max \\Av\\, (2) 

||v||=l" 

which equals \i(A) if A is positive semi-definite. To be more specific, the expression 
([2]) is often called the operator norm to distinguish it from other frequently used matrix 
norms (e.g., the Frobenius norm). The condition number |18[ p. 336] of a symmetric 
matrix A is defined to be 



«(A) 




if A is nonsingular, 
if A is singular. 



In particular, if A is positive semi-definite, then we have 



k(A) 



ifAiv(A)>0, 
00 ifAjv(A) = 0. 



148 In the following, we let I n denote the n x n identity matrix and l n denote the n x n 

149 matrix whose entries are all equal to 1. 
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2.2 An algorithm 



Given a prototype correlation matrix E = (Sjj) • J=1 , we might wish to add noise to E 
in a computationally efficient way such that the resulting matrix S is also a correlation 
matrix. Furthermore, we might also require effective bounds on the condition number 
k(S) of S to ensure that S is a suitable candidate for certain numerical procedures (e.g., 
matrix inversion). For example, in the statistical software R, the default tolerance for 
detecting linear dependencies in the columns of a matrix is a condition number < 10 15 . 
The following simple procedure, which we shall apply in more specialized settings 
(Section [3]), accomplishes this task. 

Algorithm 1. Let 

1. E be a given N x N correlation matrix, 

2. < e < Atv(E) (e is the maximum noise level), 

3. M be a positive integer (the dimension of the noise space). 

Select N unit vectors ui, 112, . . . , ujv from M. M and form the M x N matrix U = 
(ui|u.2| • • • |u n ) whose columns are the Uj. The N x N matrix 



S 1 = E + e(U T U — I) 



(3) 



165 
166 



is a correlation matrix whose entries satisfy \Sij 
condition number n(S) satisfies 



Ejj| < £ for 1 < i,j < N and whose 



167 
168 
169 
170 
171 
172 



174 
175 



k(S) < 



Ai(E) + (iV-l)e 
Aat(E)-£ 



(4) 



Before justifying the procedure above, let us make a few remarks concerning the 
condition number of S. It might be the case that a certain numerical procedure (e.g., 
matrix inversion in R) cannot handle a matrix which is poorly conditioned (i.e., with a 
large condition number). Therefore we might desire that k(S) < K max for some fixed 
ftmax, which depends upon the particular requirements of the software being employed. 
From Q, it is easy to see that any e > satisfying the additional constraint 



e < 



^maxAjy(E) - Aj(E) 
Kmax + (N - 1) 



(5) 



yields an S such that n(S) < K max - 

Justification of Algorithm 1. Let E = U T U so that 



E 



( 1 

T 

u 2 u l 



T 
U{ U 2 

1 



\u^ui u^u 2 



ufujv\ 
ufu/v 



1 / 



and note that E is symmetric and positive-semidefinite (i.e., An(E) > 0). Moreover, 
E is positive definite if and only if the Uj are linearly independent |18[ Thm. 7.2.10]. 
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180 
181 
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184 
185 
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Now recall that Gersgorin's Disk Theorem 18, Thm. 6.1.1] asserts that if A = (A 



is a N x N matrix, then for each eigenvalue A of A there exists a corresponding index 
i such that 



A' 



\X-Au\ <^2\A 



3^1 



By Gersgorin's theorem and Cauchy-Schwarz, it follows that every eigenvalue A of E 
satisfies 



|A-l|<^|uf Uj |<(iV_i) 



3=1 



whence < \i{E) < N for % = 1, 2, . . . , N. 

We next define S by ^ and observe that S is of the form 



S 



( 1 S12 + euf u 2 

E 2 i + euTui 1 



S 2 AT + eur^ Ujv 

1 



(6) 



In particular, S is our original matrix S with "noise" terms eu*Uj of magnitude at 
most e added to the off-diagonal entries. To analyze the impact of adding this noise, 
we require Weyl's Inequalities [18[ Thm. 4.3.1], which assert that if A and B are N x N 
symmetric matrices, then 



X j {A) + X N (B) < \j(A + B) < X j (A) + X 1 (B) 



(7) 



for j = 1,2, . . . , N. Applying the lower inequality in Q with j = N, A = S — eliv, 
and B = eE, we obtain 

< A7v(S)-e = X n (E-eI n ) < X N (Z - eI n ) + X n (eE) < X N (S) 

from which we conclude that S is positive definite. Next, we apply the upper inequality 
in ([7]) with j = 1 which yields 

Xi{S) < Ai(E - el) + Xi(eE) < (Ai(S) - e) + Ne = Ai(E) + (iV - l)e. 

Putting this all together, we obtain the estimates 

< X N (Z)-e < X N (S) < Ai(5) < Ai(E) + (N- l)e. 

The inequality Q follows since k(S') = Ai(S , )/A J v(5'). □ 

There are a few arguments that can be made in favor of adding noise in this manner. 
First of all, the procedure described above is easy to implement numerically, and it can 
be rapidly executed. Moreover, it offers a great deal of flexibility since the dimension 
M of the ambient space that the vectors m, 112, . . . , ujy are drawn from and the manner 
in which these vectors are selected is arbitrary and can be tailored to the particular 
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188 application at hand. Finally, our method is completely general in the sense that any 

189 positive-definite N x N matrix E having constant diagonal 1 can be factored as E = 

190 U T U where U is some matrix whose columns are unit vectors (e.g., let U be the positive- 

191 definite square root of E). In other words, regardless of the method one employs to 

192 produce a positive-semidefinite matrix E = U T U for use in ([3]), the same E can in 

193 principle be generated using our approach. 

194 Let us now say a few words about the manner in which the vectors Uj are selected. 

195 If M is very small (e.g., 2 < M < 5), then many of the dot products ufuj will be 

196 relatively large in magnitude. For many purposes, this yields a very noisy coefficient 

197 matrix S based upon the original template S. Moreover, even if M is relatively large, 

198 then the matrix E = U T U can be computed extremely rapidly since generating the 

199 unit vectors u, and computing the dot products ufuj involve relatively straightforward 

200 computations (e.g., no eigenvalue calculations). 

201 There are of course many other ways which one could select the Uj. If one wishes 

202 the ujiij to be consistently large in magnitude while also ensuring that E has full 

203 rank, one lets M > N and then selects numbers a\, (%2, ■ ■ ■ , ceisr at random from [—1,1] 

204 using a continuous probability density function f{x) on [—1, 1] which favors extreme 

205 values (e.g., f{x) = \x\, f(x) = 2 ^ 2 /_ 1 7 f x ' 2 -, or a Beta distribution transformed to exist 

206 on the range [—1,1]). One then replaces the numbers u^Uj in Q by 



aittj + N 2 )(l - KP)uf Uj. (8) 

207 In effect, one is replacing the Uj 6 R M with the unit vectors (ctj, -^/l — |aj| 2 Uj) G ]R M+1 . 

208 These vectors tend to be highly correlated (but linearly independent) since the numbers 

209 azi favor extreme values in the interval [—1,1]. 

210 3 Recipes 

211 Using the basic procedure (Algorithm [T]) described above for adding noise to a given 

212 correlation matrix, we can modify our approach to take advantage of certain known 

213 structures. For the constant correlation structure (Subsection |3.1[ ) and Toeplitz struc- 

214 ture (Subsection 3.2) we first calculate sharp bounds on the smallest eigenvalues of the 

215 template correlation matrix S and then apply Algorithm [l] to obtain the noisy corre- 

216 lation matrix S. A similar computational approach to the hub correlation structure is 

217 discussed in Subsection |3.3| Each model is expanded to describe a population based on 

218 multiple clusters with the same underlying structure (and different sizes and parameter 

219 values). 

220 3.1 Constant correlation structure 

221 The first correlation structure is based on constant correlations within each cluster 

222 and between each cluster (values of the correlation differ for each relationship). In 

223 particular, observe that the approach below yields a noisy correlation matrix which 

224 has a significant amount of noise on the off-diagonal blocks. We argue that this is 

225 more realistic than simply assuming that all of these entries are zero. 
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Algorithm 2. Let 

• K denote a positive integer (the number of clusters) and k = 1, 2 . . . , K , 

• nk be a positive integer (the size of the kth cluster), 

• N = J2k=i n k (size of the desired matrix), 

• pk such that < pk < 1 (baseline correlation in the kth cluster), 

• Pmin = P2, • • • , Pk} (minimum correlation in any cluster), 

• Pmax = max{pi,p2, . . . , Pk} (maximum correlation in any cluster), 

• 5 such that < 5 < p m \ a (baseline noise between clusters), 

• Sfc be the x matrix 

( 1 Pk ■■■ Pk\ 
Pk 1 ■■■ Pk 



(9) 



\Pk Pk ■■■ 1 / 

(correlation matrix for kth cluster), 

• S = (Tiij)fj =l be the N x N matrix having the blocks along the diagonal and 
0s elsewhere, 

• e such that < e < 1 — p max (maximum entry-wise random noise), 

• M be a positive integer (the dimension of the noise space). 

Select N unit vectors m, 112, . . . , ujv from R M . The N x N matrix S = (5y)^ =1 
defined by 



■>ij 



1 



*/* = 3, 



Pk + eu^Uj i/i, j are in the kth cluster and i 7^ j, 



(10) 



<5 + eu^Uj ifi,j are in different clusters, 
is a correlation matrix whose condition number satisfies 



N(l + e) + l 

1 Pmax £ 



(11) 



Justification of Algorithm 2. In order to introduce a significant amount of noise to the 
off-diagonal blocks, we work instead with the modified correlation matrix 



/Ei - 51 



S' 



E 2 - 51 



n-2 



\ 



SlnJ 



-51 



N, 



(12) 
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where t n denotes the n x n matrix whose entries are all 1. Since 



it follows that 

Aj(£fc — St nk 




(1 - p k )In k + {Pk ~ S)l nk , 

-S) + (1- Pk 



if i 

if J 



1, 
2,3, 



(13) 



and that the eigenspace corresponding to the largest eigenvalue of — 5t nk is spanned 
by the vector l„ fc = (1, 1, . . . , 1) € M nfe . In particular, the eigenspace corresponding to 
the eigenvalue 1 — pk is (n k — l)-dimensional and any eigenvector v = (vi,V2, ■ ■ ■ , v Uk ) 



belonging to this eigenspace is orthogonal to l nk (i.e., satisfies ^ 



Tit 

i=l V i 



0). 



If we augment v by placing N — n k zeros appropriately, we obtain a vector 
v / = (0,0,...,0 J ui ) u 2) ...,« njk) 0,0,...,0)Gl JV 



niH hrifc-i 



«fc+iH Vn K 



which is an eigenvector of Yl corresponding to the eigenvalue 1 — pk since Av' = 
(1 — /3fc)v' and Iatv' = 0. It follows that the lowest N — K eigenvalues of £ are the 
numbers 1 — pk, each repeated nk — 1 times. In particular, 

Atv(S') = 1 — p m ax- 

An upper bound on the eigenvalues of X follows from ([7]) and (13): 

Ai(S') < Ai(A) + Ai(<51jv) 

< max {n k (p k - 5) + (1 - p k )} + NS 

l<k<K 

<N(l-5) + l + N5 
= N + l. 

Plugging the matrix £' into Algorithm [l] and using the preceding estimates for Ai(E') 
and Ajv(E') into Q we obtain the desired estimate (11) for k(S). □ 

3.2 Toeplitz correlation structure 

The Toeplitz structure assumes that as observations are farther away, they are less 
correlated. The Toeplitz structure has been used extensively in classification and dis- 
criminant analysis as a model for group correlations 7,9| [34|[4H|45| . In particular, the 
model assumes that each pair of adjacent observations is highly correlated and that 
the correlations between the ith and jth observations decay exponentially with respect 
to \i-j\. 

Let < p < 1 and consider the n x n Toeplitz matrix 



( 1 


p 


P 2 


P 3 ■ 




p 


1 


P 


P 2 ■ 


• p n - 2 


P 2 


p 


1 


P 


• p n - 3 


P 3 


P 2 


P 


1 


■ p n ~ A 


W 1 - 1 


p n-2 


p n-3 




■ 1 / 



(14) 
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whose ijth entry is p\ l J 'L Using the spectral theory of self-adjoint Toeplitz operators, 
it is possible to show that T n is positive-definite and that its eigenvalues satisfy 



1 



^ < Mr„) < \±£ 



i+,---<™-i-, < 15) 

for j = 1,2, ... ,n (see Appendix [A] for details). We also remark that the preceding 
bounds are quite sharp in the sense that 



lim \i(T n ) 



1 + P 
\-p 



lim A„(T n 



1 + 



(16) 



as the size n of the matrix tends to infinity. In light of the explicit bounds (15), a 
straightforward application of Algorithm [T] yields the following procedure. 



Algorithm 3. Let 

• K denote a positive integer (the number of clusters) and k = 1, 2 . 

• nk be a positive integer (the size of the ith cluster), 

• N = ~Y^ = i rik (size of the desired matrix), 

• pk be such that < p^ < 1 (correlation factor in the kth cluster), 

• Pmax = max{pi, p2, . . . , Pk} (maximum correlation factor), 

• Sfc be the nj- x Toeplitz correlation matrix 



( 1 


Pk 


Pi 


pi • 


nt- 




Pk 


1 


Pk 


^ • 


nt- 
• Pfc 


-2 


Pi 




1 






-3 


Pi 


Pi 


Pk 


i 


lit- 
• Pk 


-4 


vr 1 


ni,—2 
Pk 


rih— 3 


iiu- 4 


1 


/ 



(17) 



(correlation matrix for kth cluster), 

E = (Ejj)^ =1 6e £/ie N x N matrix having blocks E^ along the diagonal, 
I- Pxi 



< e < 



l + p E 



(maximum entry-wise random noise), 



• M be a positive integer (the dimension of the noise space). 

Select N unit vectors ui, 112, . . . , u/v /rom M Af and /orm i/ie M x N matrix U 
(ui|ii2| • • • |u n ) whose columns are the Uj. The N x N matrix 

S = Y, + e(U T U — I) 



(18) 



is a correlation matrix whose entries satisfy \Sij 
satisfies 



- Ey I < £ and whose condition number 



< 



l+Pn 

1— Pn 



(N-l)s 



1 — Pn 
l+Pn 



(19) 
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Among other things, let us remark that for typical values of p (e.g., Guo, Hastie, and 
Tibshirani let p = 0.9 in [9] ) the noise level e can be made quite large compared to most 



of the entries in each This occurs because the eigenvalue estimates (15) are strong 
and because the off-diagonal entries of each are small (due to exponential decay) 
if one is far away from the main diagonal. Thus the approach outlined above yields a 
flexible method for introducing noise into the Toeplitz model. In fact, one can introduce 
so much noise (while still obtaining a correlation matrix with controlled condition 
number) that the original block- Toeplitz structure becomes difficult to discern. 

3.3 Hub correlation structure 

The hub correlation structure assumes a known correlation between a hub observation 
(typically the first observation) and each of the other observations. Moreover, one 
typically assumes that the correlation between the 1st and the ith observation decays 
as % increases. 

Let us describe a typical example that has been considered frequently in the liter- 
ature. Suppose that the first row (and hence column) of a n x n correlation matrix A 
is to consist of the prescribed values 

-All = 1) -^li = Pmax — {PmsLX ~ Pmii 



n 



which decreases (linearly if 7 = 1) from Ayi = p max to A\ n = p m i n for 2 < i < n. 



For instance, this model is considered in Horvath et al. [26 27 43]. For the sake of 
simplicity, we consider the linear case 7 = 1 and adopt a more convenient notation. 
Rather than specifying p max and p m in, we specify only p max and work instead with the 
step size r = (p max - p min )/(n - 2). 

There are a variety of ways to generate the remainder of such a correlation matrix. 
Using any hub structure correlation matrix, we can find the smallest eigenvalue which 
gives a bound on the noise needed in Algorithm [T] For example, we can use a Toeplitz 
structure to fill out the remainder of the hub correlation matrix and, using the well- 
developed theory of truncated Toeplitz matrices [3j[4] , obtain strong eigenvalue bounds 
which can be fed directly into Algorithm [T] directly. 

Algorithm 4. Let 

• K denote a positive integer (the number of clusters) and k = 1, 2 . . . , K , 

• nfc be a positive integer (the size of the ith cluster), 

• N = Y2k=i n k ( s ^ ze °f the desired matrix), 

• pk (maximum correlation in the first row of kth cluster), 

• 5k (step size in first row/column of kth cluster), 

• CKfe.i = 1 and otk,i = Pk — T~k(i — 2) (correlations between hub and observations), 
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Efc be the n k x n k hub-Toeplitz correlation matrix 



/ 1 


Ctk,2 




«fc,4 




&k,2 


1 




«fc,3 


• a k,n k -l 






1 


«fc,2 


" a k,n k -2 




Ofc,3 




1 


C^fc,7ife — 3 


\ a k,n k 


^fc,nfc — 1 


a fc,n fe -2 




1 / 



(20) 
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(correlation matrix for kth cluster), 

• E = (Sjj)^ =1 6e i/ie N x N matrix having blocks E^ along the diagonal, 

• < e < min jl — p k — §Tfc : 1 < A; < /T} (e is the maximum noise level), 

• M be a positive integer (the dimension of the noise space). 

Select N unit vectors ui, U2, . . . , ujy from M A/ and form the M x N matrix U 
(ui|u2| • • • |u n ) whose columns are the Uj. The N x N matrix 

S = T, + e(U T U — I) 



21) 



is a correlation matrix whose entries satisfy \Sij — Ey| < e and whose condition number 



satisfies the bound Q wh- 



ere 



Ai(E) < max<j 1 + {n k - l)p k - Tfc K 2 )K jj . j < fc < ^ ^ (22) 



(23) 



Aat(S) > min ^1 — p k — -r k : 1 < < if 



Justification of Algorithm^ By Gersgorin's Disk Theorem 18, Thm. 6.1.1], the largest 
eigenvalue Ai(Efc) of E^ satisfies 

Ai(E fc ) < 1 + p k + (pk ~ r k ) H h (/Jfc - (njs, - 2)r fc ) 

= 1 + (n k - l)Pk - r k • 



319 
320 



321 
322 
323 
324 
325 
326 



This immediately yields ( 22 ) . On the other hand, it is possible to show that the smallest 
eigenvalue of E& satisfies 

A nfc (E fc )>l- p k -^T k . (24) 

To be brief, one regards the original n k x n k Toeplitz matrix as the upper-left 
principal submatrix of a (2n k — l) x (2n k — 1) symmetric circulant matrix, the eigenvalues 
of which can be exactly computed using well-known techniques [3j p. 32]. A series of 
algebraic manipulations and a standard eigenvalue interlacing result [3j Thm. 9.19] 
yields the desired inequality (24), from which (23) follows. We thank A. Bottcher, the 
author of [3~1|4], for pointing this out to us. □ 
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327 3.4 Extensions 

328 Before proceeding, let us remark that Algorithm [T] is applicable to any given positive 

329 definite correlation matrix. The amount of noise which can be added to the original 

330 matrix is determined by its smallest eigenvalue. As we have seen, for several specific 

331 classes of correlation matrices, one can obtain simple, but powerful, lower bounds on 

332 this lowest eigenvalue. For such correlation matrices, we have provided explicit, spe- 

333 cialized algorithms which provide a significant amount of noise while also maintaining 

334 quantitative control over the condition number of the resulting matrix. 



335 4 Results 

336 4.1 Simulated Data 

337 One method for generating a noisy correlation matrix is to simulate normal data from 

338 an original template and find the sample correlation matrix from the data. Varying the 

339 sample size of the generated data can create correlation matrices which are more or less 

340 variable (in magnitude). However, the nature of the variability (distribution) is similar 

341 across different sample sizes. In particular, the majority of the entries in a given sample 

342 correlation matrix generated from normal data are quite close to the template matrix. 

343 Only a handful of observations deviate from the template substantially. Additionally, 

344 the sample size needed in order to get a large amount of variability could be smaller than 

345 the dimension of the correlation matrix (thus producing sample correlation matrices 

346 that are not positive definite). 

347 To demonstrate the restriction associated with simulating normal data as a way to 

348 find sample correlation matrices, we generate multiple correlation matrices using both 

349 normal samples and our method. The three normal structures (sample sizes 25, 250, 

350 1000) show the same tendencies with more spread for smaller sample sizes. The three 

351 simulations using our method are based on uniform random vectors in M 2 (GH1) and 

352 IR 25 (GH3) as well as uniform random vectors in R 2 with an additional a component 

353 from a Beta(l,2) distribution as in equation ([8]) (GH2). For each simulation we used 

354 a constant correlation structure with three groups comprised of sample sizes of (n\ = 

355 100, ri2 = 50, ri3 = 80), within cluster correlations of (pi = 0.7, p2 = 0.7, p3 = 0.4), and 

356 between cluster correlations of 5 = 0.25. 

357 Figure [T] shows the spread of entry- wise differences between the sample correlation 

358 matrices and the template for the six simulation scenarios. We see that the Garcia- 

359 Hardin (GH) scenarios are able to add larger noise terms than the normal simulation. 

360 Figure [2] shows the distribution of differences. Depending on the application, one 

361 might prefer large noise components, uniform noise component, or bell-shaped noise 

362 components. Each of those structures is easily generated by the algorithms outlined in 

363 the paper. 

364 To demonstrate the effectiveness of our method, we simulate data from three mod- 

365 els to show that noise added to a known structure can help to evaluate a clustering 

366 algorithm. For illustrative purposes, we use the PAM algorithm 122] to cluster and the 



367 adjusted Rand statistic 36,42 to measure the degree of concordance between the clus- 
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template p 1 = 0.7 p 2 = 0.7 p 3 = 0.4 5 = 0.25 



Ei\\ =o 



GHl p~ x = 0.7 p 2 = 0.7 p 3 = 0.4 5 = 0.25 ^ 6 IR 2 , [je^ = 0.29 

GH2 pi = 0.7 p 2 = 0.7 ^3 = 0.4 5 = 0.25 e, G IR 2 , = 0.29, a ~ Beta(l, 2) 

GH3 pi = 0.7 p 2 = 0-7 p 3 = 0.4 J = 0.25 e, G R 25 , [jgjjj = 0.29 

Norm25 Pl = 0.7 p 2 = 0.7 p 3 = 0.4 5 = 0.25 25 vectors 

Norm250 p x = 0.7 p 2 = 0.7 p 3 = 0.4 5 = 0.25 250 vectors 

NormlOOO p x = 0.7 p 2 = 0.7 p 3 = 0.4 5 = 0.25 1000 vectors 



Table 1: Six different correlation matrix generating scenarios. GHl, GH2, and GH3 use the algorithms 
given in the paper for constant correlation. The normal simulations use the template matrix with the given 
sample size of random vectors. Each correlation matrix is based on a setting of 3 clusters with sizes (100, 
50, 80). 



368 
369 
370 
371 
372 
373 
374 
375 
376 
377 



tering output and the truth. Using silhouette width, the unsupervised PAM algorithm 
will give the optimal number of clusters. The adjusted Rand statistics models the 
degree of concordance between the PAM results and the truth. An adjusted Rand of 1 
indicates perfect concordance; an adjusted Rand of zero indicates a random partition. 

For each of the models we tested, we created the correlation matrix (including 
noise) using the appropriate algorithm. For constant correlation, we used Algorithm 
[2] to obtain a matrix of the form 10 Algorithm [3] was used to obtain a matrix with 
a Toeplitz structure as in equation (18). We used Algorithm [T] for adding noise to a 
general structure, and the adjustment of the algorithm to simulate the hub- Toeplitz 



structure as in equation (21). 



4.2 Constant Correlation 

To assess simulating the constant correlation using the parameter settings below we 
use Algorithm [2] All simulations were done in the 3 cluster setting with sample sizes 



= 100, 
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379 Clustering Results 

380 For each of the scenarios, we simulated 1000 correlation matrices. We then clustered 

381 the data using PAM; the clustering results were assessed by determining the number 

382 of clusters the algorithm produced (truth was 3 clusters) as well as the concordance 

383 between the clustering results and the truth (1 gives perfect concordance). 
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Figure 1: Each boxplot represents the distribution of the entry- wise differences between the generated 
correlation matrix and the template structure. 



Scenario 


CC1 


CC2 


CC3 


CC4 


min # clusters 


3 


7 


2 


3 


median # clusters 


3 


12 


11 


3 


max # clusters 


3 


17 


15 


3 


median adj Rand 


0.950 


0.303 


0.300 


1 



Table 2: Results from optimal number of clusters as well as the adjusted Rand. The original correlation 
structure had 3 clusters. A perfect allocation of points gives an adjusted Rand of 1. 



384 Our results show that adding noise can create scenarios about which the algorithm 

385 is unable to determine the true structure (CC2 and CC3) and scenarios where the noise 

386 is not sufficient to decrease the performance of the algorithm (CC1 and CC4). CC1 

387 and CC2 had reasonably low between cluster correlation but a high bound on the noise. 

388 CC3 and CC4 had high between cluster correlation and a low bound on the noise. In 
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HG1; dim R A 2 



HG2; dim R A 2, Beta(1,2) 
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Figure 2: Each histogram represents the distribution of entry- wise differences between the generated matrix 
and the template. The distribution of differences for GH3 is similar to the correlation matrix generated by 
sampling 250 random normal vectors. 
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CC1 we select the noise vectors from R 5 which creates noise which is generally small 
in magnitude. In additional simulations (not shown), large within cluster correlations 
tend to be consistent with perfect clustering results; small within cluster correlations 
give rise to poor clustering results. 
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393 4.3 Toeplitz Correlation 

394 The rapid off-diagonal decrease of the correlation values in the Toeplitz structure makes 

395 clustering difficult. Even with a base correlation of 0.9, observations 20 places away 

396 will be correlated at only 0.9 20 = 0.12. Our experience is that the PAM algorithm is 

397 only able to cluster Toeplitz structure data with quite small sample sizes or quite high 

398 correlations. However, we do recognize some structure to the results. Each final cluster 

399 has observations which are all correlated by at least 0.5. That is, the cluster sizes are 

400 determined by the base p value and not the original sample size. 

401 To assess simulating the Toeplitz correlation structure, we simulated three equal 

402 sized clusters with a given baseline p; all noise vectors were chosen randomly from IR 2 . 

403 As before, for each scenario listed below, we simulate 1000 correlation matrices and 

404 use Algorithm [3] for adding noise to the correlation matrix. The cluster size, baseline 

405 p and length of the noise vector (||ej||) are given in table [3j 



cluster size 


baseline p (| £j||) 


0.8 (0.1) 


0.9 (0.05) 


0.95 (0.025) 


0.99 (0.005) 


20 


□ 


A 


O 





50 


■ 


▲ 


• 


♦ 


100 


■ 


A 


• 





Table 3: The twelve different simulation settings and their respective symbols used in Figure^ 



406 Clustering Results 

407 Figure [3] gives the results of the Toeplitz simulation. Each dot represents the average 

408 over all 1000 simulations of the median cluster size (on a natural log scale) and the 

409 average over all 1000 simulations of the reported number of clusters. The vertical lines 

410 represent the minimum and maximum cluster size of all 1000 simulations (on a natural 

411 log scale). The cluster size depends primarily on the baseline p value, not the original 

412 cluster size. The horizontal line is the average of the cluster sizes over three simulation 

413 settings. Interestingly, the correlation cutoff for each cluster is reasonably constant. As 

414 seen in Table |4j the clustering results seem to be a function of the furthest observations 

415 being correlated by at least 0.5. Further analysis on this topic is suggested, but it is 

416 beyond the scope of the current work. 

417 The case of p = 0.99 correlation warrants special consideration. For small sample 

418 sizes, the PAM algorithm finds the correct allocation of points each time. For the 

419 simulation with 100 observations in each cluster, PAM finds six clusters instead of three 

420 (resulting in clusters of size 50). Note that with such a strong correlation, there is very 

421 little noise we can add while ensuring that the resulting matrix is a correlation matrix. 

422 Therefore, the simulations for the p = 0.99 correlation setting are unsurprisingly similar 

423 across the 1000 replicates. 
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o 20 pts per 

• 50 pis per 

• 100 pts per 

O 0.99 cor 

o 0.95 cor 

-a- 0.9 cor 

-a- 0.8 cor 

* + \ 



20 40 60 80 100 

estimated # clusters 

Figure 3: Each dot represents the average over all 1000 simulations of the median cluster size (on a natural 
log scale) and the average over all 1000 simulations of the number of clusters. The vertical lines represent 
the minimum and maximum cluster size of all 1000 simulations (on a natural log scale). The cluster size 
depends primarily on the baseline p value, not the original cluster size. The horizontal line is the average of 
the cluster sizes over three simulation settings. 



424 4.4 hub-Toeplitz Correlation 

425 To assess simulating the hub-Toeplitz correlation structure using the parameter settings 

426 below we use Algorithm [4} All simulations were done in the three cluster setting with 

427 sample sizes of (n\ = 100, n-i = 50, = 80). Recall that with the hub-Toeplitz 

428 correlation, the correlation values descend according to some power (here linearly) 

429 from a specified maximum to a specified minimum correlation. 
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430 Clustering Results 

431 For each of the scenarios, we simulated 1000 correlation matrices. We then clustered 

432 the data using PAM; the clustering results were assessed by determining the number of 

433 clusters the algorithm produced (truth was three clusters) as well as the concordance 

434 between the clustering results and the truth (1 gives perfect concordance). 

435 As with the constant correlation, using the hub-Toeplitz structure, we can add 

436 noise that keeps clustering structure intact or that obscures enough of the structure to 
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0.8 □ 0.9 A 0.95 O 0.99 

Average resulting cluster size a 3.003 4.971 12.704 50 6 

Minimum correlation for cluster 0.8 3 003 = 0.5 1 0.9 4 - 971 = 0.59 0.95 12 - 704 = 0.52 0.99 50 = 0.61 

a After finding the median cluster size for each simulation, the average is taken across different original 
cluster sizes and across the 1000 simulations. The average resulting cluster size is given by the horizontal 
lines in Figure [3J 

b Because PAM was able to find the correct clusters for small samples using p = 0.99, only the simulations 
with clusters of size 100 are reported for p = 0.99. 

Table 4: The resulting clusters are composed of objects that are correlated by no less than about 0.5 



Scenario 


hTCl hTC2 hTC3 hTC4 hTC5 


min # clusters 
median # clusters 
max # clusters 
median adj Rand 


3 3 3 3 3 
11 8 3 3 3 
20 13 3 3 3 
0.320 0.4142 1 1 1 



Table 5: Results from optimal number of clusters as well as the adjusted Rand. The original correlation 
structure had 3 clusters. A perfect allocation of points gives an adjusted Rand of 1. 



437 make clusters indistinguishable. Recall, in the hub-Toeplitz structure, the correlations 

438 within a cluster degrade linearly (as opposed to exponentially in the Toeplitz structure 

439 of Section [3]). For correlation structures that degrade all the way to zero (hTCl and 

440 hTC5), the algorithm is able to discern the structure if the original correlations are 

441 large (hTC5). For correlation structures that degrade only a small amount (hTC2, 

442 hTC3, hTC4), the results are based on the amount of error and the dimension from 

443 which the noise vectors are selected. 

444 5 Conclusion 

445 We have developed an algorithm for adding noise, in a highly controlled manner, to 

446 a template correlation matrix in order to obtain a more realistic correlation matrix. 

447 Moreover, we have demonstrated how our general procedure can be tailored to a number 

448 of different correlation models (e.g., constant correlation, Toeplitz structure). 

449 Our method allows for noisy correlation matrices which differ more from the initial 

450 template than the estimated correlation matrix based on simulated normal data. Us- 

451 ing normal data produces a sample correlation matrix with limited and well-behaved 

452 (possibly unrealistic) differences from the original template correlation if the generated 

453 sample is large. If the generated sample is small, then the sample correlation matrix is 

454 not positive definite (i.e., most of the eigenvalues will be zero). Using uniform [—1,1] 

455 deviates as random correlation values produces a matrix that is in general not even 

456 positive semidefinite. It can also create relationships between observations that are 

457 meaningless (e.g., a trio of observations where the first is highly correlated to the other 

458 two but the other two are negatively correlated). 
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459 Using a standard clustering algorithm, we have shown that simulated correlation 

460 matrices can be used to assess existing or novel statistical methodology. We provide 

461 the user with detailed algorithms to use on several standard clustering structures, as 

462 well as a general algorithm to apply to any correlation matrix for which the smallest 

463 eigenvalue can be reasonably estimated. 



A Appendix 



465 
466 
467 



In this section we justify the crucial inequalities (15) and the limits (16). First observe 



that the Toeplitz matrix T n from (14) is simply the upper-left corner of the infinite 
Toeplitz matrix 
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(25) 



which induces a linear operator T on the Hilbert space £ 2 of all square-summable 
infinite sequences. Since the ijth entry of T is the (i — j)th complex Fourier coefficient 



of the function P p {9) : [— ir, tt] 



defined by 



P H e in9 



1 — p cos 9 + p 2 ' 
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we conclude from [4, Thm. 1.9] that T is a bounded selfadjoint operator whose spectrum 
equals the range of P p |11, Pr. 250] (note that P p (6) is the so-called Poisson kernel 
from the study of harmonic functions). 



A short calculus exercise reveals that P p {9) achieves its maximum value at 
= and its minimum value at 9 = ±7r (see Figure 4) from which we conclude 




Figure 4: The Poisson kernel P p (0) for p = 0.2, 0.5, 0.8. As p ->• 1~, the graphs spike sharply at 6 = 
while tending rapidly to zero for 9 away from 0. Intuitively, the functions P p {0) approximate a point mass 
(i.e., Dirac (5-function) at 8 — as p — > 1~. 
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that the spectrum of T is precisely the closed interval [j^, j^,]- By [4, Prop. 2.17], 
it follows that the spectrum (i.e., the set of eigenvalues) of T n is also contained in 



this interval. This establishes the inequalities (15). The limiting behavior (16) follows 
immediately from [21 Thm. 5.14]. 
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